262
17
Genomics
and, hence, that appears as conserved regions in a group of evolutionarily related
gene sequences. This is not a strong definition, not least because the motif concept
is based on a mosaic view of the genome that is opposed to the more realistic (but
less tractable) systems view.
The construction of the concise descriptions could be either deductive or inductive.
A difficulty is that extant natural genomes are not elegantly designed from scratch,
but assembled ad hoc, and refined by “life experience” (of the species). The use of
fuzzy criteria may help to overcome this problem.
In practice, intrinsic methods often boil down to either computing one or more
parameters from the sequence and comparing them with the same parameters com-
puted for sequences of known function, or searching for short sequences that expe-
rience has shown are characteristic of certain functions.
17.5.1
Signals
In the context of intrinsic methods for assigning a function to DNA, the term “signal”
denotes a short sequence relevant to the interaction of the gene expression machinery
with the DNA. In effect, one is paralleling the action of the cell (e.g., the transcription,
splicing, and translation operations) by trying to recognize where the gene expression
machinery interacts with DNA. In a sense, therefore, this topic belongs equally well to
interactomics (Chap. 23). Much use has been made of so-called consensus sequences,
which are formed from sequences well conserved over many species by taking the
most common base at each position. The distance (e.g., the Hamming distance) of
an unknown sequence from the consensus sequence is then computed; the closer
they are, the more likely it is that the unknown sequence has the same function as
that represented by the consensus sequence. Useful signals include start and stop
codons (Table 7.1). More sophisticated signals include sequences predicted to result
in unusual DNA bendability or known to be involved in positioning DNA around
histones, intron splice sites in eukaryotic pre-mRNA and sequences corresponding
to ribosome binding sites on RNA, and so on.
Special effort has been devoted to identifying promoters, which are of great interest
as potential targets for new drugs. It is a hard problem because of the large and variable
distances between the promoter(s) and the sequence to be transcribed. The approach
relies on relatively well-conserved sequences (i.e., effectively consensus sequences)
such as TATA or CCAAT. Other sites for protein–DNA interactions can be examined
in the same way; indeed, the entire transcription factor binding site can be included
in the prototype object, which allows more sophistication (e.g., some constraints
between the sequences of the different parts) to be applied.